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Abstract. We consider a high-dimensional regression model with a possible change-point 
due to a covariate threshold and develop the Lasso estimator of regression coefficients as 
well as the threshold parameter. Under a sparsity assumption, we derive nonasymptotic 
oracle inequalities for both the prediction risk and the i\ estimation loss for regression 
coefficients. Furthermore, we establish conditions under which the unknown threshold pa- 
rameter can be estimated at nearly rT l when the number of regressors can be much larger 
than the sample size in). We illustrate the usefulness of our proposed estimation method 
via Monte Carlo simulations and an application to real data. 
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1. Introduction 

The Lasso and related methods have received rapidly increasing attention in statistics 
since the seminal work of Tibshirani (1996). For example, see a timely monograph by 
Biihlmann and van de Geer (2011) as well as a retrospective review by Tibshirani (2011) 
for general overview and recent developments. 

In this paper, we develop a method for estimating a high-dimensional regression model 
with a possible change-point due to a covariate threshold, while selecting relevant regressors 
from a set of many potential covariates. In particular, we propose the i\ penalized least 
squares (Lasso) estimator of parameters, including the unknown threshold parameter, and 
analyze its properties under a sparsity assumption when the number of possible covariates 
can be much larger than the sample size. 

To be specific, let {(Yi, Xi,Qi) : i = 1, . . . , n} be a sample of independent observations 
such that 

(1.1) Y i = X% + X' i 8 l{Q i <T Q } + U i , i = l,...,n, 

where for each i, X{ is an M x 1 deterministic vector, Qi is a deterministic scalar, Ui follows 
iV(0,<7 2 ), and 1{-} denotes the indicator function. The scalar variable Qi is the threshold 
variable and tq is the unknown threshold parameter. Note that since Qi is a fixed variable 
in our setup, (1.1) includes a regression model with a change-point at unknown time (e.g. 
Qi = i/n). 

A regression model such as (1.1) offers applied researchers a simple yet useful framework 
to model nonlinear relationships by splitting the data into subsamples. Empirical examples 
include cross-country growth models with multiple equilibria (Durlauf and Johnson, 1995), 
racial segregation (Card et al., 2008), and financial contagion (Pesaran and Pick, 2007), 
among many others. Typically, the choice of the threshold variable is well motivated in 
applied work (e.g. initial per capita output in Durlauf and Johnson (1995), and the minority 
share in a neighborhood in Card et al. (2008)), but selection of other covariates is subject 
to applied researchers' discretion. However, covariate selection is important in identifying 
threshold effects (i.e., nonzero 5q) since a piece of evidence favoring threshold effects with 
a particular set of covariates could be overturned by a linear model with a broader set of 
regressors. Therefore, it seems natural to consider Lasso as a tool to estimate (1.1). 

The statistical problem we consider in this paper is to estimate unknown parameters 
(/3o,5oj t o) £ M 2A/+1 when M is much larger than n. For the classical setup (estimation of 
parameters without covariate selection when M is smaller than n), estimation of (1.1) has 
been well studied (see, e.g., Tong, 1990; Chan, 1993; Hansen, 2000). Also, a general method 
for testing threshold effects in regression (i.e. testing Hq : 6q = in (1.1)) is available for 
the classical setup (see, e.g., Lee et al., 2011). 
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Although there are many papers on Lasso type methods and also equally many papers 
on change points, sample splitting, and threshold models, there seem to be only a handful 
of papers that intersect both topics. Wu (2008) proposed an information-based criterion 
for carrying out change point analysis and variable selection simultaneously in linear mod- 
els with a possible change point; however, the proposed method in Wu (2008) would be 
infeasible in a sparse high- dimensional model. In change-point models without covariates, 
Harchaoui and Levy-Leduc (2008, 2010) proposed a method for estimating the location of 
change-points in one-dimensional piecewise constant signals observed in white noise, using 
a penalized least-square criterion with an ^i-type penalty, and Zhang and Siegmund (2007) 
developed Bayes Information Criterion (BlC)-like criteria for determining the number of 
changes in the mean of multiple sequences of independent normal observations when the 
number of change-points can increase with the sample size. Ciuperca (2012) considered a 
similar estimation problem as ours, but the corresponding analysis is restricted to the case 
when the number of potential covariates is small. 

In this paper, we consider the Lasso estimator of regression coefficients as well as the 
threshold parameter. Theoretical properties of the Lasso and related methods for high- 
dimensional data are examined by Bunea et al. (2007), Candes and Tao (2007), Bickel et al. 
(2009), Meinshausen and Yu (2009), and van de Geer and Biihlmann (2009), among many 
others. Most of the papers consider linear or nonparametric models with an additive mean 
zero error. Some exceptions are van de Geer (2008) who considered high-dimensional gen- 
eralized linear models with Lipschitz loss functions; Belloni and Chernozhukov (2011a) who 
developed the Lasso estimator of quantile regressions in high-dimensional sparse models; 
and Bradic et al. (2012) who worked out nonconcave penalized methods, including Lasso, 
for Cox's proportional hazards model with high-dimensional censored data. We contribute 
to this literature by considering a regression model with a possible change-point and then 
deriving nonasymptotic oracle inequalities for both the prediction risk and t\ estimation loss 
for regression coefficients under a sparsity scenario. Since the Lasso estimator selects vari- 
ables simultaneously, we show that oracle inequalities can be established without pretesting 
the existence of the threshold effect. Furthermore, we establish conditions under which the 
unknown threshold parameter can be estimated at nearly n _1 when the number of regressors 
can be much larger than the sample size (n). 

The remainder of this paper is as follows. In Section 2 we propose the Lasso estimator, 
and in Section 3 we give a brief illustration of our proposed estimation method using a 
real-data example in economics. In Section 4 we establish the prediction consistency of our 
Lasso estimator. In Sections 5 - 8, we establish sparsity oracle inequalities in terms of both 
the prediction loss and the l\ estimation loss of (ao,To), while providing low- level sufficient 
conditions for three possible cases of threshold effects. In Section 9 we present results of 
some simulation studies. Section 10 concludes and Appendix A contains all the proofs. 
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2. Lasso Estimation 

Let Xf(r) denote the (2M x 1) vector such that Xi(r) = X'^iQi < r})' and let X(r) 
denote the (n x 2M) matrix whose i-th row is Xj(r)'. This is distinguished from X (r) , 
which denotes the (n x M) matrix whose i-th row is X[l{Qi < r}. Let ao = (f3' , 5' Q )' . Then 
(1.1) can be written as 

(2.1) Yi = X i (T y ao + U i , i = l,...,n. 

Following Bickel et al. (2009), we use the following notation. For an L-dimensional 
vector a, let \a\ p denote the £ p norm of a, and \J\ denote the cardinality of J, where 
J(a) = {j £ {1, ...,L} : a,j 7^ 0}. In addition, let Ai(a) denote the number of nonzero 
elements of a. Then, 

L 

M(a) = Y,n^^0} = \J(a)\. 
i=i 

The value Ai(ao) characterizes the sparsity of the model (2.1). Also, let aj denote the vector 
in M L that has the same coordinates as a on J and zero coordinates on the complement J c 
of J. For any n-dimensional vector W = (Wi, . . . , W n )', define the empirical norm as 

\ 1/2 

i=i / 

Let y = (Yi, . . . , Y n )' . For any fixed t, consider the residual sum of squares 

n 

S n (a, t) = n- 1 {Yi ~ X'tf - X'^Qi < r}) 2 
i=l 

= l|y-X(r)a|| 2 , 

where a = (/?', 6')'. 

Indicating by the superscript V> the j-th element of a vector or the j-th column of a 
matrix, define the following (2M x 2M) diagonal matrix: 

D(r) :=diag{|x(r)W|[, j = l,...,2M}. 

For each fixed r, define the Lasso solution a(r) by 

(2.2) S(r) := argmin agR 2M {S n (a,r) + A|D(r)a| 1 } , 

where A is a tuning parameter that depends on n. It is important to note that for each 
fixed r, S(r) is the weighted Lasso, which has advantages over the unweighted Lasso since 
different values of r generate different dictionaries. 
We now estimate tq by 

f := argmin TeTcM {5 n (a(r), r) + A ID^S^U , 
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where T = [io>^i] is a parameter space for tq. In fact, for any finite n, t is given by an 
interval and we simply define the maximum of the interval as our estimator. If we wrote the 
model using 1 {Qi > r} , then the convention would be the minimum of the interval being 
the estimator. Then the estimator of oto is defined as a := S(r). In fact, our proposed 
estimator of (a, r) can be viewed as the one-step minimizer such that: 

(2.3) (a,r) := axgmin agR 2M iTgTcR {S n {a, r) + A |D(r)a| 1 } . 

3. Empirical Illustration 

In this section, we apply the proposed Lasso method to the growth regression models 
in economics. The neoclassical growth model predicts that economic growth rates would 
converge in the long run. This theory has been tested empirically by looking at the negative 
relationship between the long-run growth rate and the initial GDP given other covariates 
(see Barro and Sala-i-Martin (1995) and Durlauf et al. (2005) for literature reviews). Al- 
though empirical results confirmed the negative relationship of them, there has been some 
criticism that the results heavily depend on the selection of covariates. Recently, Belloni 
and Chernozhukov (2011b) show that the Lasso estimation can help select the covariates 
in the linear growth regression model and that the Lasso estimation results reconfirm the 
negative relationship between the long-run growth rate and the initial GDP. 

We consider the growth regression model with a possible threshold. Durlauf and Johnson 
(1995) provide the theoretical background of the existence of multiple steady states and 
estimate the model with two possible threshold variables. They check the robustness by 
adding other available covariates in the model, but it is not still free from the criticism of 
the ad hoc variable selection. Our proposed Lasso method might be a good alternative in 
this situation. Furthermore, as we will show later, our method works well even if there is 
no threshold effect in the model. Therefore, one might expect more robust results from our 
approach. 

The regression model we consider has the following form: 

(3.1) grt = fa + PilgdpeOi + Xfa + HQi < r} (<5 + hlgd^ + Xfa) + e< 

where gr i is the annualized GDP growth rate of country % from 1960 to 1985, lgdp60 i is the 
log GDP in 1960, and Qi is a possible threshold variable for which we use the initial GDP 
and the adult literacy rate in 1960 following Durlauf and Johnson (1995). Finally, Xi is 
a vector of additional covariates related to education, market efficiency, political stability, 
market openness, demographic characteristics, and so on. The list of all covariates used 
and the description of each variable are given in Table 1. We include as many covariates 
as possible, which would mitigate the potential omitted variable bias. The data set mostly 
comes from Barro and Lee (1994), and the additional adult literacy rate is from Durlauf 
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and Johnson (1995). Because of missing observations, we have 80 observations with 46 
covariates (including a constant term) when Qi is the initial GDP, and 70 observations with 
47 covariates when Qi is the literacy rate. 

Tables 2 and 3 summarize the model selection and estimation results. To compare dif- 
ferent model specifications, we also apply the Lasso procedure to a linear model, i.e. all <5's 
are zeros in Equation (3.1). In each case, the regularization parameter A is chosen by the 
'leave-one-out' least squares cross validation method. 

Main empirical findings are as follows. First, note that the number of covariates in 
the threshold models is bigger than the number of observations. Thus, we cannot adopt 
the standard least squares method to estimate the threshold regression model. Second, 
the coefficients of lgdp60 are negative in all models, which confirms the theory of the 
neoclassical growth models. Third, the coefficients of interaction terms between lgdp60 and 
various education variables show the existence of threshold effects in both threshold model 
specifications. This result implies that the growth convergence rates can vary according 
to different education levels. Specifically, note that the interaction term between Igdp and 
l educ' implies the marginal effect of Igdp becomes 

Igdp x \ Bi + j3 2 educ + 1{Q < j}(5i + foecZuc)^ . 

In both threshold models, we have 5\ = 0, but some fo's are not zero. Thus, conditional 
on other covariates, there exist different technological diffusion effects according to the 
threshold point. In other words, a country with high education levels will converge faster 
by absorbing technology easily and quickly. Finally, the Lasso with the threshold model 
specification selects a more parsimonious model than that with the linear specification even 
though the former imposes more covariates. 

Compared to the results by Durlauf and Johnson (1995), our estimation results show a 
couple of different points. The Lasso estimator does not confirm the threshold effect for 
the variable IgdpGO itself. Different convergent rates are made only through the interaction 
with the education variables. It is also noteworthy that the threshold parameter estimates 
are much higher than those chosen by Durlauf and Johnson (1995). These differences show 
the importance of model selection and the advantage of the proposed Lasso estimation. 



4. Prediction Consistency 

In this section, we establish the prediction consistency of our Lasso estimator. For no- 
tational simplicity, we make the following convention, that is, D = D(r) and D = D (to) , 
and similarly, S n = S n (a,r) and S n = S n (ao,7o), and so on. 
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Define f( ajT )(x,q) := x'/3 + x'51{q < r}, fo(x,q) := x'/3 + x'5 l{q < r }, and f(x, q) := 
x'j3 + x'51{q < ?}. Let 



V\j := ( ncr 



V2j(r) := [na 



n/ ' — * 

i=l 



i=l 



For a positive constant // < 1, define the events 

M 

A:= P| {2\Vij\ < jUA/cr} 



B:= f] (2su P |%(t)| </iA/al, 

| I I r£T J 

Also define Jo := J(ao) and R n := R n (ao,To), where 

n 

R n {a,T) :=2n- 1 Y J UiX' i 5{l(Q t <?)-l(Q i <r)}. 



i=l 



The following lemma gives some useful basic inequalities that provide a basis for all our 
theoretical results. 



Lemma 1 (Basic Inequalities). Conditional on the events A and M , we have 

< 2A D(S-a )j 



(4.1) 

and 
(4.2) 



/-/( 



+ (1 — //) A D(S — ao) 



f-fc 



+ (1 — fx) A D(a — ao) 



+ A Da — |Da |i + 

l 

|. 1 1 2 

D(a - a )j i + ||/(«o,t) ~ /o|| n 



< 2A 



The basic inequalities in Lemma 1 involve more terms than that of the linear model (e.g. 
Lemma 6.1 of Biihlmann and van de Geer, 2011) because our model in (1.1) includes the 
unknown threshold parameter To and the weighted Lasso is considered in (2.2). Also, it 
helps prove our main results to have different upper bounds in (4.1) and (4.2) for the same 
lower bound. 

We now establish conditions under which AnB has probability close to one with a suitable 
choice of A. Define 



mm 



|* {j) (*o)||; 



l<j<M \\xU) 



|2 ' 
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where JW(t) = (X^l{Qi < r}, . . . ,xi j) l{Q n < r})' as before. Let $ denote the cumu- 
lative distribution function of the standard normal. 

Lemma 2 (Probability of An B). Let {Ui : i = 1, . . . , n} be independent and identically 
distributed as N(0, a 2 ). Then 

(Lt / TIT 
_py — « A 
2<T 

We are ready to establish the prediction consistency of the Lasso. Define X max := 
max(D) and X m j n := min (D (to))- Also, let a max denote the maximum value that all the 
elements of a can take in absolute value. 

Theorem 3 (Consistency of the Lasso). Let (2,r) be the Lasso estimator defined by (2.3) 
with 

(4.3) A = ^(^) 1/2 

V nr n J 

for some constant A > 2\/2//j. Then, with probability at least 1 — (3M) 1- " 4 ^ ^ , we have 

^ 2 

f — fo < 6AX max a max A^(ao) + 2/iAX max | <5o 1 1 

n 

< 8X m£LX a max XM(a ). 

The nonasymptotic upper bound on the prediction loss in Theorem 3 can be easily 
translated into asymptotic convergence. Specifically, if X max and a max are bounded, then 
Theorem 3 gives 

/ - fo 2 < XM(a ). 

n 

Hence, Theorem 3 implies the consistency of the Lasso, provided that n — > oo, M — > oo, 
and AA^(ao) — > 0. The last condition requires that the sparsity of the model be of smaller 
order than (nr n ) / log 3M. 

Remark 1. Regarding consistency of the Lasso, see, among others, Corollary 6.1 of Biihlmann 
and van de Geer (2011) for high- dimensional linear models and Lemma 6.7 of Biihlmann 
and van de Geer (2011) for general convex loss functions. If r n is bounded away from zero, 
then our result in Theorem 3 coincides with those of Biihlmann and van de Geer (2011). 

5. Oracle Inequalities 

In this section, we establish sparsity oracle inequalities in terms of both the prediction 
loss and t\ estimation loss of ao- First of all, we make the following assumption that was 
first introduced by Bickel et al. (2009). 
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k(s,c ) : = 



mm 
J Q{1,...,2M}, 

\J \<S 



Assumption 1 (Restricted Eigenvalue (RE) (s,co)). For some integer s such that 1 < s < 
2M and a positive number cq, the following condition holds: 

|X(r )7| 2 ^ n 
mm — — — > 0. 

77^0, V n nJo\2 
\fj§\ 1 <co\'yj \ 1 

Assumption 1 is just a restatement of restricted eigenvalue assumption of Bickel et al. 
(2009) when To were known. Bickel et al. (2009) provide sufficient conditions for the re- 
stricted eigenvalue condition. In addition, van de Geer and Biihlmann (2009) show the 
relations between the restricted eigenvalue condition and other conditions on the design 
matrix. 

Assumption 2 (Oracle Condition A). For some non-negative constant L\, either one of 
the following two conditions holds: 



(5.1) 
(5.2) 



\f(a ,r) ~ /o|L < LtX 



A 



Da 



i 



|D"oli 



+ R n < L t X 



D(d 
D(a 



«o) Jo 
«o) Jo 



Assumption 3 (Oracle Condition B). For some positive constant L2, the following condi- 
tion holds: 

2 



(5.3) 



("To) 



ML < Li 



/-/( 



Assumptions 2 and 3 are useful to obtain an oracle inequality, in conjunction with As- 
sumption 1. These tighten the bounds in Lemma 1. 

Conditions (5.1) and (5.2) in Assumption 2 are rather high-level assumptions but useful 
to derive sparsity oracle inequalities. In next sections where we present main theorems of 
the paper, we verify Assumption 2 or dispense with it under more primitive conditions. 
Intuitively, Assumption 2 is satisfied if 5q is very small, or if the difference between r and 
To is sufficiently small relative to the difference between a and «o- 

Remark 2. It is worth noting that (5.1) holds when do = 0, that is, when there is no 
threshold effect. Therefore, the oracle inequalities below hold regardless of the existence 
of threshold effect, implying that we can make prediction without knowing the presence of 
threshold effect or without pretesting it. 

Remark 3. The smallest L2 in Assumption 3 can be chosen as 

\\f(a,T ) ~ fo\\ n 



f-k 

provided that the denominator is nonzero. It seems natural, as an important special case, to 
assume that the prediction loss is at least as large as an infeasible prediction risk replacing 
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t with To, which would imply that Li = 1. When we have L2 > 1, this results in an increase 
in oracle inequalities below relative to the case when tq were known. 

The following lemma is useful to derive sparsity oracle inequalities of the Lasso. 

Lemma 4 (Oracle Inequalities of the Lasso). Assume that Assumption 1 holds with k = 
K ( s > 1+ /l^ t Ll ) f or /•* < 1 and Ai(ao) < s < M. Furthermore, let Assumptions 2 and 3 hold. 
Then conditional on the event AQB, we have 



and 



f-fo 
\a — ao\ 1 < 



2 < (2 + Li) X% ias _L 2 2 



K 

\2 V 2 



(2 + Li) X max L 2 



As. 



(1 — II) X m i ni 

Remark 4. Compared to the case when tq were known (so that we can take T = {tq} ) and 
thus L\ = and L2 = 1, the upper bound is bigger by the multiple of \fL~2 (2 + L\) /2 for 



f-fo 



and of L 2 (2 + L x ) 1 /4 for \ a, — QqIx • These multipliers can be viewed as prices 



to pay to estimate unknown threshold parameter tq. 

We now provide a lemma to derive an oracle inequality regarding the sparsity of the 
Lasso estimator q. To do so, we make the following assumption. 

Assumption 4. Assume that the largest eigenvalue of ' X(r)'X(r) / 'n is bounded uniformly 
in t G T by <f> max • 

Lemma 5 (Sparsity of the Lasso). Let Assumption 4 hold. Then conditional on the event 
A P| IB, we have 



(5.4) 



M(a) < 



f-fo 



(l-^) 2 A 2 * mir 

Lemma 5, combined with Lemma 4, implies that A4(a) is a just constant multiple of 
Ai(ao), where the constant depends on L\, L2, X m£ix , X mm , </> max , /i and k (independent of 
A). 



6. Case I. No Threshold 

We first consider the case that 5q = 0. In other words, we estimate a threshold model via 
the Lasso, but the true model is simply a linear model Y\ = X[Pq + C/j. This is an important 
case to consider since in applications, one may not be sure not only about covariates selection 
but also about the existence of the threshold in the model. 

When 5q = 0, then Assumption 2 (in particular, equation (5.1)) holds automatically with 
L\ = 0. Then the following theorem can be proved easily, thanks to Lemmas 4 and 5. 



THE LASSO FOR HIGH-DIMENSIONAL REGRESSION WITH A POSSIBLE CHANGE-POINT 11 



Theorem 6. Assume that 5q = and Assumption 1 with (s,cq) = (s, for some 

■M- («o) < s < M and Assumption 3 hold. Let Ui follow N(0,a 2 ) and (S,r) be the Lasso 
estimator defined by (2.3) with 

V nr n J 

and A > 2^/2/ [i. Then, with probability at least 1 — (3M) 1 ~ A M ^ 8 , we have 
f-fo 



< 2AaX max / L 2 log3M ^ 1/2 



nr ri 

< 4AaL2 ( teg3M \ 1/2 s 
1 ~ (1 - /x) k 2 X min V «r w / 

// Assumption 4 holds in addition, then 

(l-/i) 2 «2 ^ 



■M(a) < r^^^s. 



To appreciate the usefulness of the inequalities derived above, it is worth comparing 
inequalities in Theorem 6 with those in Theorem 7.2 of Bickel et al. (2009). The latter 
corresponds to the case that Jo = is known a priori, A = 2Aa (log M/ra) 1 / 2 , p = 1/2, 
and X max = 1 using our notation. It can be seen that if we take the same A both in 
Theorem 6 and in Theorem 7.2 of Bickel et al. (2009), the bounds for the prediction risk, 
the t\ estimation loss of olq, and the sparsity of a are larger only by the multiples of 
^2A max , L2X ma _ x / X m [ n and L2X max /X^ in , respectively. As mentioned in Remark 4, these 
multipliers can be viewed as prices to pay to estimate (ao, to) without knowing that Sq = 0. 
Perhaps more importantly, when these multipliers are bounded uniformly in n, the main 
implication of Theorem 6 is that the our Lasso estimator in (2.3) gives qualitatively the 
same oracle inequalities as the Lasso estimator in the linear model, even though our model 
is much more overparametrized in that 5 and r are added to f3 as parameters to estimate. 



7. Case II. Diminishing Threshold 

We now consider the case when there is a nonzero 8q, but the threshold parameter tq is not 
well-identified though. We formulate this case by assuming that max J= i ) ... ! jv^ \Soj\ = d§n~ v ', 
for some positive constants v and do, and call this case the diminishing threshold. To 
establish oracle inequalities, we need to make the following additional assumption. 

Assumption 5 (Smoothness of Design). For any r/ > 0, there exists C < oo such that 

1 n 2 

sup sup -Y,\ X i J) \ 1 (Qi<ro)-l(Qi<T)\<Cr ] . 



3 \T-T \<ri n 



i=l 
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Assumption 5 has been assumed in the classical setup with a fixed number of stochastic 
regressors to exclude cases like Qi has a point mass at to or E (Xi\Qi = to) is unbounded. In 
our setup, Assumption 5 amounts to a deterministic version of some smoothness assumption 
with respect to the threshold variable Qi. 

Let L\ be the non- negative solution of the following equation (in terms of x): 

f(x) = x(2 + x) = Cd 2 n~ 2v M (5 ) \h - t Q \ k 2 (Xl ax L 2 X 2 M (a )) _1 =: C* Ll . 
That is, L\ = + + + C* Li . We now give oracle inequalities for the diminishing threshold 



case. 



Theorem 7. Assume that 

(7.1) max \5oj\ = don ~ v , 

j=l,...,M 

for some positive constants v and do. Also, let Assumption 1 with (s, cq) = ^s, —jz^- 1 1 for 
some Ai (oto) < s < M, Assumption 3, and Assumption 5 hold. Let Ui follow N(0, a 2 ) and 
(a, t) be the Lasso estimator defined by (2.3) with 



X = Aa 



log3Af\i/2 



/ log 6M \ 
\ nr„ I 



nr n 

and A > 2^/2/ fi. Then, with probability at least 1 — (3M) 1 ~ A ^ ^ 8 , we have 



f-fo 



< Aa (2 + L\) X max / L 2 log3M g 



1/2 



i a _ ao | < ^ (2 + L\) 2 L 2 Xj ax Aog3M \ 1/2 g 
1 " (1 - fi) k 2 X min V nr n J 



If Assumption 4 holds in addition, then 



A ^ a)< 40 max (2 + L-) 2 L 2 XL X j 



Theorem 7 gives qualitatively equivalent inequalities as those in Theorem 6. Note that 
k's in Theorems 6 and 7 can be different from each other, since different cq are assumed in 
the RE condition. 

Remark 5. The diminishing threshold case can be viewed as a local departure from the 
no-threshold case 5q. In this view, it is interesting to know the situation when the positive 
constant L\ is close to zero. Note that L\ approaches zero if and only if C£ gets close 
to zero. Suppose that C, do, to, ti, k, X max , L 2 , A, and a are independent of n. Then L\ 
converges to zero if and only if 

n~ 2 »M(5 )M (ao)- 1 (nr n )/(log3M) -> 0. 
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Hence, if r n is bounded away from zero and M diverges to infinity, then v can be larger 
than or equal to 1/2, which gives the lower limit for the diminishing 5q to be interpreted as 
the local departure from the no-threshold case. 



8. Case III. Fixed Threshold 

This section explores the case where the threshold effect is well-identified and discontin- 
uous. We begin with the following assumption to reflect this. 

Assumption 6 (Identifiability under Sparsity and Discontinuity of Regression). For a 
given s > Ai (cxq) , and for any n and r such that \t — to\ > n > min^,- \Qi — Qj\ and 
a £ {a : M (a) < s}, there exists a c > such that 

\\f(a,r) ~ f(a ,ro)\\l >cn>0. 

Assumption 6 implies, among other things, that for some s > A4 («o) , and for any 
a £ {a : M (a) < s} and r such that (a,r) / (ao,To), 

\\f(ot,T) - f(a ,T )\\ n 0- 

This condition can be regarded as identifiability of tq. If tq were known, then a sufficient 
condition for the identifiability under the sparsity would be that RE (s, cq) holds for some 
Co > 1. Note that the RE condition is not required for other values of r than tq in our paper. 
Thus, the main point in (8.1) is that there is no sparse representation that is equivalent to 
/o when the sample is split by r / to. For the fixed threshold case, basically we replace 
Assumption 2 with Assumption 6, believing that the latter condition is easier to interpret 
than the former. 



Remark 6. Assumption 6 is stronger than just the identifiability of to as it specifies the rate 
of deviation in f as r moves away from tq . The linear rate here is sharper than the quadratic 
one that is usually observed in more regular M-estimation problems and it reflects the fact 
that the limit criterion function, in the classical setup with a fixed number of stochastic 
regressors, has a kink at the true To. For instance, suppose that {(Yi, JQ, Qi) : i = 1, . . . , n} 
are independent and identically distributed, and consider the case where only the intercept 
is included in Xj. Assuming that Qi has a density function that is continuous and positive 
everywhere (so that P (r < Qi < tq) and P (tq < Qi < r) can be bounded below by c\ |r — tq| 
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for some c\ > 0), we have that 

E (Yi - fi (a, r)) 2 - E (Y; - ft (a , r )) 2 
= E(/ i KT )-/ j (a,r)f 
= (ai - a 10 ) 2 P(Qi < t A r ) + (a 2 - 020) 



'(Qi > rVr ) 



+ (a 2 - aio) 2 P (r < Qj < r ) + («i - a 20 ) 2 P (to < Qi < r) 
> c|r-r |, 

/or some c > 0, where fi (a, t) = X-/3 + X-<51{(5j < t}, ai = /3 + 8 and a 2 = /?, unless 
|a 2 — ctio| is too small when r < tq and \a\ — 020 1 is too small when r > tq. However, 
when |«2 — aio I is small, say smaller than e, |a2 — a 2 o| is bounded above zero due to the 
discontinuity that aio a 20 and P (Qi > r V tq) = P (Qj > to) is a/so bounded above zero. 
This implies the inequality still holds. Since the same reasoning applies for the latter case, 
we can conclude our discontinuity assumption holds in the standard discontinuous threshold 
regression setup. In other words, the previous literature has typically imposed conditions 
sufficient enough to render this condition. 

Remark 7. The restriction rj > mim^y \Qi — Qj\ in Assumption 6 is necessary since we 
consider the fixed design for both Xi and Qi. Throughout this section, we implicitly assume 
that the sample size n is large enough such that mim^y \Qi — Qj\ never binds in any of 
inequalities below. This is typically true for the random design case if Qi is continuously 
distributed. 

To simplify notation, in this section, we assume without loss of generality that Qi = i/n. 
Then T = [to,£i] C [0, 1]. For some constant 77 > 0, define an event 



sup 

\t-t \<T) 
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UiX'fy [1 (Qi < T ) - 1 (Qi < t) 



i=i 



The following lemma gives the lower bound of the probability of the event AflBfl 
[n^ 1 C(rj J )] for a given m and some positive constants rji, ...,r/ m ,. To deal with the event 
n^ =1 C(r/j), an extra term is added to the lower bound of the probability, in comparison to 
Lemma 2. 

Lemma 8 (Probability of AnBn{n^ 1 C(?7j)}). For a given m and some positive constants 
rji,...,r) m , 

Af^Bp) f ^C '(//;) !> > I - <>.]/<!> 



2a 



X\/n 



4V$ - 

\ 2y/2ah n (r )j ] 



where h n (rj) = ((2nry)" 1 Et^)] ( X 'i S of " 
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The following lemma gives an upper bound of \f — tq\ using only Assumption 6, condi- 
tional on the events A and B. 

Lemma 9. Let rf = max{min^- \Qi - Qj\ , c~ 1 6\X max a maiX M(a ) + c~ 1 2fi\X max |5oli}- 
Suppose that Assumption 6 holds. Then conditional on the events A and B, 

\t - T 1 < rf. 

Remark 8. The nonasymptotic bound in Lemma 9 can be translated into the consistency 
off, as in Theorem 3. That is, if n — > oo, M — > oo, and AW4(ao) —> 0, Lemma 9 implies 
the consistency off, provided that X max , a maX ; and c~ l are bounded uniformly in n. 

We now provide a lemma for bounding the prediction risk as well as the t\ estimation 
loss for olq. In particular, we state results without resorting to Assumption 5.1. 

Lemma 10. Suppose that Assumption 1 with (s, cq) = (^s, jz^j for some Ai (ao) < s < M, 
and Assumptions 3, 5 hold. If \f — tq\ < c T for some c T , then conditional on A, B and 
C(c r ), we have 



As can be seen in the proof of Lemma 10, both the prediction risk and the l\ estimation 
loss for ao can be small if \f — to| is small, even without Assumption 5.1. The following 
lemma shows that the bound for |f — tq\ can be further tightened if we combine results 
obtained in Lemmas 9 and 10. 

Lemma 11. Suppose that \f — tq\ < c T and \a — ao\ l < c a for some (c T ,c a ). Let f\ := 
cr 1 ^AmaxCa + ^/c^ + (2X m i n ) _1 \6~q\i Cc T ^j A. // Assumption 6 holds, then conditional on 
the events A, B, and C(c T ), 



Lemmas 9, 10, and 11 suggest that we may be able to develop a chaining argument to 
obtain sharper bounds for the prediction risk and the i\ estimation loss of (ao,To), as we 
demonstrate in the following theorem. Before we state our main theorem, we first make an 
additional assumption. The following condition consists of inequality constraints on A, s, 
\5q\ 1 , and other constants. 



f-fo < 3A [yfc + c^- 1 ^ |<Sb|i] V 





T - T 1 < fj. 
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Assumption 7 (Inequality Conditions). The following inequalities hold: 



(8.3) » > f , + 1 V + 1 " " ^ 4 ' ' - " 1 " 



(1 /i) ^T m i n y \ ^min 3 

(8 4) a > ^ ( 3Xmax I 1 

3cL2X max \ (1 — ^l) X m i n 

Remark 9. It would be easier to satisfy Assumption 7 when the sample size n is large. To 
appreciate Assumption 7 in a setup when n is large, suppose that (1) n —> oo, M — > oo, 
s — > oo, and A — > 0; (2) \5o\ 1 may or may not diverge to infinity; (3) X m \ n , X max , a max , 
k, c, C , and L 2 are independent of n. Then (8.2)-(8.4) can hold simultaneously for all 
sufficiently large n, provided that A — > 0. 

We now give the main result of this section. 

Theorem 12. Suppose that Assumption 1 with (s, Co) = (^s, fzj^j f or some A4 (ao) < s < 
M, and Assumptions 3, 5, 6, and 7 hold. Let Ui follow N(0,a 2 ) and (a, r) be the Lasso 
estimator defined by (2.3) with 



X = Aa 



log3Af\i/2 



nog 6M \ 
V nr n / 



and ^4 > 2^/2/ fi. Then, there exist a sequence of constants 771, ...,r/ m * /or some finite m 



aucft ffcat, with probability at least 1 - (3M) 1_A ^ /8 -4^S'i (3M) _A /( 16r »M»w)) ; we /j aue 



/-/o 



< 3^crX max / L 2 log 3M 



1/2 

— S ) ' 

1/2 



| S _ ao | < 9AtL 2 Xj &x / log3M ' 
1 ~ (1 - n) k 2 X min \ nr n 



and 

If _ ro | < f + 9L 2 X max A 2 a 2 lo g 3M ^ 

~~ V^min 3 / (1~h)k 2 c nr n 
If Assumption 4 holds in addition, then 

\A (*\ s 360max-^2 -X~ m 

M (a) < g TP2 — s - 

(1 - (J,) K A min 

Theorem 12 gives the same inequalities (up to constants) as those in Theorems 6 and 7 for 
the prediction risk as well as the l\ estimation loss for ao- Note that Assumption 5.1 is not 
needed to obtain Theorem 12. This is because we have used the result from the tight bound 
for |f — tq\. It is important to note that |f — ro| is bounded by A 2 s, whereas \a — «oli is 
bounded by As. This can be viewed as a nonasymptotic version of the super-consistency of 
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f to tq. One of main contributions of this paper is that we have extended the well-known 
super-consistency result of r when M < n (see, e.g. Chan, 1993) to the high-dimensional 
setup (M > n). 



9. Monte Carlo Experiments 

In this section we conduct some simulation studies and check the properties of the pro- 
posed Lasso estimator. The baseline model is the following threshold regression: 



Yi = X% + X'MQi < to} + U U i = 1, 



, n, 



where Xj is a M-dimensional vector generated from N(0,I), Qi is a scalar generated 
from the uniform distribution on the interval of (0,1), and the error term U% is gener- 
ated from iV(0, 0. 5 2 ). The threshold parameter is set as tq = 0.3,0.4, and 0.5 depend- 
ing on the simulation design, and the coefficients are set as fio = (1, 0, 1, 0, . . . , 0), and 
So = c ■ (0, —1, 1, 0, . . . , 0) where c = or 1. Note that there is no threshold effect when 
c = 0. The number of observations is set as n = 200. Finally, the dimension of Xi in each 
design is set as M = 50, 100, 200 and 400, so that the total number of regressors are 100, 
200, 400 and 800, respectively. The range of r is T = [0.15,0.85]. 

We can estimate the parameters by the standard LASSO/LARS algorithm of Efron et al. 
(2004) without much modification. Given a regularization parameter value A, we estimate 
the model for each grid point of r that spans over 71 equi-spaced points on T. This 
procedure can be conducted by using the standard linear Lasso. Next, we plug-in the 
estimated parameter aOr) := I /3{t)', 6 Or)' ) for each r into the objective function and 



choose t by 

(9.1) ?:=arg min I S (a Or) , r) + A D (r) 1/2 S Or) } 

rGTCIR I lJ 

and S := S(r). The regularization parameter A is chosen by 



(9.2) A:=ixffi 



'log (3M) 



nr ri 



where r n = minj ||-X'^(to)||n/ll-^'^lln ano - a = ^ is assumed to be known. For the 
constant A, we use four different values: A = 2.8, 3.2, 3.6, and 4.0. 

Tables 4 and Figures 1-2 summarize these simulation results. To compare the perfor- 
mance of the Lasso estimator, we also report the estimation results of the least squares 
(Least Squares) available only when M = 50 and two oracle models (Oracle 1 and Oracle 
2, respectively). Oracle 1 assumes that the regressors with non-zero coefficients are known. 
In addition to that, Oracle 2 assumes that the true threshold parameter tq is known. Thus, 
when c / 0, Oracle 1 estimates (/3 (1) , /3 (3) , <5 (2) , 5 (3) ) and r using the least squares while 
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Oracle 2 estimates only ((3^,(3^,5^,5^). When c = 0, both Oracle 1 and Oracle 2 
estimate only ((3^, (3^). All results are based on 400 replications of each sample. 

The reported mean-squared prediction error (PE) for each sample is calculated numer- 
ically as follows. For each sample s, we have the estimates (3 S , 5 S , and t s . Given these 
estimates, we generate a new data {Yj , Xj , Qj } of 400 observations and calculate the pre- 
diction error as 

l 400 2 

(9.3) PE S = — ^2 yf(xj,qj;Po,6 ,T ) - f(xj,qj;P s ,5 s ,T s fj 

i=i 

where f(x,q;(3,5,r) = x' (3 + x'51{q < r}. The mean, median, and standard deviation of 
prediction errors are calculated from the 400 replications, {PEs}^^. In Table 4, we also 
report mean of A4(a) and £i-errors for a and t when M = 50. For simulation designs with 
M > 50, Least Squares is not available. Figures 1-2 report the similar statistics only for 
the Lasso estimators. 

First, the proposed Lasso estimator performs better than Least Squares in all designs. 
This result reveals more evidently when there is no threshold effect, i.e. c = 0, which shows 
the robustness of the Lasso estimator for whether or not there exists a threshold effect. We 
can reconfirm the robustness when M = 100, 200, and 400 from Figures 1-2. Second, as 
predicted by the theory developed in previous sections, the prediction errors and t\ errors 
for q and r increase slowly as M increases. The graphs also show that the results are quite 
uniform across different regularization parameter values except A = 4.0. Finally, 

We next consider different simulation designs. The M-dimensional vector X{ is now 
generated from a multivariate normal -/V(0, E) with = /o' 4- -'', where (S)jj denotes the 

(i,j) element of the M x M covariance matrix S. All other random variables are the same 
as above. We conducted the simulation studies for both p = 0.1 and 0.3; however, Tables 5 
and Figures 3-4 only report the results of p = 0.3 to save space (the results with p = 0.1 
are similar). They show very similar results as previous cases: Lasso outperforms Least 
Squares, and the prediction error, M(a), and £i-errors increase very slowly as M increases. 

Figure 5 shows frequencies of selecting true parameters when both p = and p = 0.3. 
When p = 0, the probability that the Lasso estimates include the true nonzero parameters 
is very high. In most cases, the probability is 100%, and even the lowest probability is as 
high as 98.25%. When p = 0.3, the corresponding probability is somewhat lower than the 
no-correlation case, but it is still high and the lowest value is 80.75%. 

In sum, the simulation results confirm the theoretical results developed earlier and 
show that the proposed Lasso estimator will be useful for the threshold model with high- 
dimensional regressors. 
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10. CONCLUSIONS 

We have considered a high-dimensional regression model with a possible change-point 
due to a covariate threshold and have developed the Lasso method. We have derived 
nonasymptotic oracle inequalities and have illustrated the usefulness of our proposed es- 
timation method via simulations and a real-data application. It would be an interesting 
future research topic to extend the adaptive Lasso of Zou (2006) to our setup and to see 
whether we would be able to improve the performance of our estimation method. 



Appendix A. Proofs 



Proof of Lemma 1 . Note that 

(A.l) S n + \ DS 

for all (a, r) £ R 2M x T. Now write 
S n - S n (a, t) 



< S n (a, r) + A |D(r)a| 1 



n 1 |y — X(r)S?|2 — n 1 |y — X(r)a|2 

n n 

1 E P* ~ { X ^)' S " Xifo/ao}] 2 " n ^ E P* ~ { X i( T )' Q " Xi(To)'ao}] : 

i=l i=l 
n n 

_1 E { X i(^)'« ~ X^ro)'^} - n" 1 ^ {Xi(r)'a - Xj(To)'a } 
i=i 

n 

2n_1 E ^ { X *( f )' a " X i( T )'«} 



n 



i=i 



i=i 

1 1 2 



/ - /o - ||/(a,T) - /o|| n 



u 
n 



2n 



1 Y, UiX[0 - P) - 2n~ l Ui [x'MQi < r) - X[S\{Qi < r)} . 



i=l i=l 

Further, write the last term above as 



n 

1 ^ tr< { A^l(Q, < f ) - X#l(Qi < r)} 
i=i 

n n 

1 Y,UiX' l (5-5)l(Qi<?) + n- 1 Y,UiX' i 5{l(Q i <?)-l(Qi<T)} 



i=l 



8=1 
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Hence, (A.l) can be written as 



f-fo 



< \\f(a,r) ~ fo\\l + A iD^a^ - A Da 

n n 

+ In' 1 ^ UiX'tf-p) + 2n _1 UiXi(S- <5)1(Q* < ?) 



i=l 
n 



+ 2U- 1 UiXtfiUQi < t) - l(Qi < r)} . 



i=l 

Then on the events A and B, we have 



(A.2) 



f-fo < ||/(o,r) - /o|| + ^ D(S - a) 



+ A|D(r)a| 1 -A Da ^ + R n (a,r) 



for all (a, r) G R 2M x T. 
Note the the fact that 

(A.3) 







a 



a 



U) 



for j $ Jo- 



On the one hand, by (A.2) (evaluating at (a,r) = (<3!q,to)), on the events A and B, 

2 „ ^ 

+ (1 — //) A D(S — «o) 



/-/o 
< A 
+ A 



D(a — cto) 



+ 



Da 



Da 



Da 



Da |i +^n(«o,To) 



< 2A 



D(a - a ) Jo 



+ A 



Da 



|Da |i + i? n (ao,T"o), 



which proves (4.1). On the other hand, again by (A.2) (evaluating at (a,r) = (oiq,t)), on 
the events A and B, 

2 



f-fo 

< A 

< 2A 

which proves (4.2). 



+ (1 — fi) A D(a — ao) 



D(a — ao) 



+ 



Da 



i 

Da 



J + \\f(a ,r) ~ fo\\ n 



D(a - a )j i + ||/( ao ,T) ~ /o|| n 



□ 



Proof of Lemma 2. Since U{ ~ N(0, a 2 ), 
M 

P{A C } <^P{v^|^iil > iWn\/(2<r)} = 2M$ 
i=i 

where the last inequality follows from r n < 1. 



2a 



2cr 
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Now consider the event B. Note that ||X^'(r)|| is monotonically increasing in r and 
~Y^i=\ UiX^l {Qi < t} can be rewritten as a partial sum process by the rearrangement of i 
according to the magnitude of Qi. To simplify notation, we assume without loss of generality 
that Qi = i/n. Then, by Levy's inequality (see e.g. Proposition A. 1.2 of van der Vaart and 
Wellner, 1996), 



}<p. 


sup 




l<s<n 



0) 



1=1 



> 



n 2(7 



< 2P{ y/n\Vij\ > 



\xV\to)\ 



|X(i) I 



^^A 



2(7 



Therefore, we have 



M r i 

P{B C } < VP sup^|y 2 i(r)| > ^\/{2a) \ 



< 4M$ 



Hy/r n n 
2o~' 



Since P{A n B} > 1 



'}, we have proved the lemma. 



□ 



Proof of Theorem 3. Note that 



R n = 2U- 1 UiXfo {l(Qi <?)- l(Qi < r )} 



i=l 



Then on the event 



M 



(A.4) 



\Rn\ < 2fiXj2\\ X[j) 

3=1 

< 2/j,XX max \ 8o\ 1 . 
Then, conditional on AflB, combining (A.4) with (4.1) gives 

2 



(A.5) 
since 



f-fo 



+ (1 - //) A D(S - ao) < 6AA max a max A^(ao) + 2^AX n 



|D(r)(S - cko)joIi - 2X max a max >l(ao), 



IDool 



< 2X n 



|"o| 



Using the bound that 2<3? (—a;) < exp (— x 2 /2j for x > as in equation (B.4) of Bickel et al. 
(2009), Lemma 2 with A given by (4.3) implies that the event AnB occurs with probability 
at least 1 — (3M) 1 ~ A ^ Then the theorem follows from (A.5). □ 
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Proof of Lemma 4- Combining (4.2) with (5.1) in Assumption 2 or combining (4.1) with 
(5.2) in Assumption 2 yields 

2 



(A.6) 



f-fo 



+ (1 - n) A D(3 - ao) < (2 + Ii) A D(S - a )j 



which implies that 



D(a- a )jc 



< 



1 + /i + Li 
l-/i 



D(a - a )j 



i 



This in turn allows us to apply Assumption 1, specifically RE(s, 1+ 1 / ^ t Z ' 1 ), to yield 



^ 21^ 

D(S - q )j < -|X(r )D(S - a )|| 
2 n 



1 



(A.7) 



n 



(a - a ) / DX(ro) / X(r )D(a - a ) 



< max ^ D ^ (a _ ao) / X(To) / X(r )(S - q ) 



n 



= max(D) 2 \\f( S ,T ) ~ fo\\ 2 n 

where k = k(s, 1+ 1 A ^ t Ll ). 

Combining (A.6) with (A.7) yields 

f-fo 2 < (2 + L\) A r3(a-Q )j 

n 

< (2 + L 1 )A^|d(3-qo)j 
(2 + Li)A 



< 



Vsmax(D) ||/( S>ro ) - /o|| n 

7-/o 



< ( 2 ±MV^^ max(6 ) 

K 



where the last inequality follows from Assumption 3. Then the first conclusion of the lemma 
follows immediately. 
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In addition, combining the arguments above with the first conclusion of the lemma yields 



D (S — ao) 



D(q - a ] 



Jo 



+ 



D(S — «o) 



< {2 + L l ){l-^y 1 D(S-ao) 



•A) 



< (2 + Li)(l-nY 



D(3 - a ) 



Jo 



2 + L x 

< — ; r max 



< 



k(1 - fi) 
k(1-h) 



D ) 1 1 /(5,7b) - fo\\ n Vs 



max(D) 



f-fo 



(2 + L!) J AL 2 2 
" (I-ij)k 2 max ' 

which proves the second conclusion of the lemma since 



(A.8) 



D (a — ao) > min(D) | (a — ao)li • 



□ 



Proof of Lemma 5. As in (B.6) of Bickel et al. (2009), for each r, the necessary and sufficient 
condition for 3(r) to be the Lasso solution can be written in the form 
2, 



n 



[l«]'(y-X(r)a(r)) = A 



[X«]'(y-X(r)«(r)) 



n 



< A 



[X«(r)]'(y-X(r)a(r)) = A 



n 



[X»(r)]'(y-X(r)«(r)) 



< A 



sign(/3^(r)) 



sig 



n(?^(r)) 



where j = 1, . . . , M. 

Note that conditional on events A and 



n 



i=l 



X (i) 



^C^f i (, ' ) l{Q i <r}<MA X^(r) 



n 



i=l 



if ^O')( r )/0 

if j gW)( r )= 

if ?0')(r)/ 
if ?0')( r )= 0, 
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for any r, where j = 1, . . . , M. Therefore 

2 

n 



-[xW] / (X(T )ao-X(r)3(r)) 



> (1 - y) A 



X U) 



n 



[xW(r)]'(X(To)ao-X(r)S(r)) 



> (1-m)A I (3) (r) 



if /?0')(r)/0 
if ?0')( r )/ 0. 



Using inequalities above, write 
1 



n- 



[X(r )a - X(f)S] / X(f)X(r) / [X(r )a - X(f)S] 



M 2 i M 

- 2 £ {[X^] f [X(r )a - X(f)S]} + £ {[I (j) (f)f[X(r ) tt „ - X(f)a]} 
i=i i=i 



> 



7) 



> 



> 



,2 X] 

(1-/^) 2 A 2 
4 

(l-/^) 2 A 2 v2 



'[X(T )ao-X(f)S]} 2 + ^ J] {[I«(f)]'[X(r )a -X(f)a][ 



j:<5OV0 
2 



Vj:/3U)^0 



^min-^ («) ■ 



To complete the proof, note that 



[X(r )a - X(r)S] / X(r)X(r) / [X(r )a - X(f)a] 



< maxeig(X(r)X(r)'/n) 

2 



/-/o 



< 



f-fo 



where maxeig(X(r)X(r) / /n) denotes the largest eigenvalue of X(r)X(r) / /n. 



□ 



Proof of Theorem 6. Since 5q = 0, Assumption 2 holds with L\ = 0. Then theorem follows 
by combining Lemmas 4 and 5 with the bound on P(A n B) as in the proof of Theorem 
3. □ 



Proof of Theorem 7. First of all, recall the bound on P(A n B) be obtained as in the proof 
of Theorem 3. We consider the following two cases: (i) (54) holds with L\ = L\, and (ii) 
it does not hold. 

Case (i). If (54) holds with L\ = L\, then the conditions of Lemma 4 are satisfied. In 
this case, the theorem can be proved as in the proof of Theorem 6 by combining Lemmas 
4 and 5. 
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Case (ii). Thus, it remains to consider the case that (5.1) does not hold, that is, 



(A.9) L\\ 
First, note that the fact that 



D(5 - a )j ^ < \\f(a ,T) ~ fo\\ n ■ 



\\f(a ,r) ~ /o||' = - ( X ^) 2 \HQi< ^} " 1 {Q l < T}\ 

(A.io) n n jri 

<dln- 2 »CM {5 Q )\t x -t Q \, 

where the last inequality follows from (7.1) and Assumption 5 with rj = \t± — to\- 
Now combining (4.2), (A.9), and (A.IO) together yields that 

f-fo 



" ( L* +1 ) ~/°lln 



< ( + 1 ) Cdln-^M^lto-h 



Cdln 2v M (<5 )|to-ti| 



C* 



= {2 + L1) \ XL * L2 \ 2 M(a ), 

K 

where the last two equalities follow from the construction of L\. This proves the first 
conclusion of the theorem. 

To obtain the bound on \a — olq\i, note that again using (4.2), (A.9), and (A.IO) and 
repeating the same arguments as above, we have 

1 ( 2 



D (a — ao) 



< 



i- (l-/i)A \L\ 
(2 + L\) 2 A^ ax L 2 



< 



~ 1 ) ||/(a ,f) ~~ /o| 

AA4(a ), 



which, combined with (A. 8), proves the second conclusion of the theorem. The third con- 
clusion follows immediately from Lemma 5. □ 
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Proof of Lemma 8. Given Lemma 2, it remains to examine the probability of C (jjj). As in 
the proof of Lemma 2, Levy's inequality yields that 



{Vj) C } < 



sup 

\T-M<m 



< 2P 



< 4$ 




UiXfa [1 {Qi < T ) - 1 (Qi < t) 



i=i 



[n(T +rjj)] 

E UiXiSo 

i=[n(r -Vj)] 
\\/n \ 



2y/2ah n (rjj) J 

Hence, we have proved the lemma since P {a f| B f| CfjLi c (Vj) } - 1 ~ H^ c } ~ P{B C } - 



Proof of Lemma 9. As in the proof of Lemma 1, we have, on the events A and B, 
S n - S n (a , r ) 



(A.ll) 




7 


-fo 




> 


7 


-fo 



2n- 1 UiXi(f3 - /3b) - In" 1 UiX'^S - S )l(Qi <?)- R r , 



»=i 



Rn 



Then using (A. 3), on the events A and B, 



Da 



[S n (a ,T ) + A (DaolJ 



(A.12) 



> 


7- 


/o 


> 


7- 


fo 


> 


7- 


fo 



/j,X 



A 



2A 



A 



Dq |i - 
Daoli 



DS 



i 

Da 



D(a — ao) 
D(a - a ) j 
[6AX max a max A4(a ) + 2/xAA" 

max |"UliJ i 

where the last inequality follows from (A. 4) and the following bounds: 

2A 



Rn 



D(S - a ) j < 4X 

max 

"max AX (a ) , 



A 



|Da |i 



Da 



< 2A max a max A/M (ao) • 



Suppose now that \f — tq\ > rfe. Then Assumption 6 and (A.12) together imply that 



Sn + A 



Da 



[S n (a , r ) + A |Dao|i] > °> 



which leads to contradiction as r is the minimizer of the criterion function as in (2.3). 
Therefore, we have proved the lemma. □ 
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Proof of Lemma 10. Recall the basic inequality in (4.1): 

~ 2 

f-fo 



+ (1 — /j,) A D(2 — ao) 



< 2A 



D(S — ao) 



Jo 



+ A 



Da 



|Daoli 



+ Rn- 



Note that on C, 



\Rn\ 



2n~ l UiXiSo {l(Qi < ?) - l(Qi < r )} 



t=i 



and due to the mean value theorem (applied to / (x) = y/x) and Assumption 5, 
(A.13) 



Da 



|D«o|i 



EF 



,(./) 



<j2l\\ x ^ (j) 1 *o C0 |^Ek w2 iiw<<n-iWi< < n)}i 

. ^ II n I ii I 

j=l 1=1 



We now consider two cases: (i) D(S — ao)j > -y/cv + c T C2 1 X min | <5q | x and case (ii 



D(2 - a )j i < ^c~ T + c T C2- 1 X mi 1 n 1^. 

Case (i): In this case, note that (5.2) in Assumption 2 holds with L\ = 1. Thus, we can 
repeat the proof of Lemma 4 with L\ = 1, which gives that on on A and B, 

9L 2 X* 



| a — aoli 



< 



f-fo 



[1 - fl) K 2 X m 
2 - 9 ^max-^2 ^2 



-Xs, 



< 



Case (ii): If D(S — ao)j < \fc^ + c r C2 X min \5q\ 1 , then it follows from (4.1) that 



f-fo 



<3A [^ + c T C2- 1 X^ n \5 \ 1 ] 



D(S-ao) x < t^- [v^ + cr^- 1 ^ |*olij ■ 



which implies that 



I a — aoli < 



(1 — n)X m m 
Therefore, we have proved the lemma. 



[^ T + c T C2- l X^ n \5,\ l ]. 



□ 
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Proof of Lemma 11. Note that on A, B and C, 



n 

[utx'i [p - A)) + uaii (Qi <f)(S- do 



i=l 



< /xAA~ max | a - a |i < MAX max c 



and 



n 



UtXfo [1 (Qi < f ) - 1 (Qi < r )] 



Suppose fj < \f — ro | < c T . Then, as in (A. 11), 



(A.14) 5 n -5„(a ,ro) > 

Furthermore, due to (A. 13), we obtain 



f-fo 



/iAA max c a — A-y/cv. 



S n + A 



Da 



[5 n (a ,To) + A IDqqIJ 



> C7) - (^X max c Q + y^) A - C2 X miQ c T \5 \ 1 \. 

Thus, since cfj = \ X max c a + y/c^ + (2X m i n ) -1 | <5o | x CcrJ A, we again use the contradiction 
argument as in the proof of Lemma 9 to prove the lemma. □ 



Proof of of Theorem 12. Here we use the chaining argument by iteratively applying Lemmas 
10 and 11 to tighten the bounds for the prediction risk and the estimation errors in a and 
f . 

In view of Lemma 9, we first start with 

c T . — c 8A max Q! max As. 

Suppose that 

( 5) (l-/*)*ndn - (l-»)K 2 Xn ^ ^ ' 



which in turn implies from Lemma 10 that \a — ao\ 1 and f — fc 
the theorem given the choice of A. Then Lemma 11 with c a = Ca yields that 
|t - r 1 < c7 x \X maK c a + ^fc~ T + (2X min )~ 1 |^olx Cc r) A 



achieve the bounds in 



< / Af max ^ 1 - /A 9L 2 A^ ax ^ ^ 



X min 3 /c(1-|i)k 2 
Thus, it remains to show that there is convergence in the iterated applications of Lemmas 
10 and 11 toward the desired bound when (A. 15) does not hold. 
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When (A. 15) does not hold for cf for some m, where the superscript m indicates the 
m-th iteration, Lemmas 10 and 11 imply that 



Jm+1) 



V +c T A^ldoli 



(l-/i)X n 



and 



4 m+1) = c- 1 (x maxC (T +1 ) + ^ + (2X min )- 1 ItfoK C4 m )) A. 



Whenever < , we can stop the iteration and the desired bound is achieved as 
discussed in (A15). Hence, it suffices to derive the fixed point when we start with the 
initial m such that ci" 1 ^ > c$ . 

Next, we derive the fixed point as follows. First, suppose that of > (C2 _1 X~ i 1 n |5oli) 2 ■ 
Recall that |<5o|i 7^ due to Assumption 6. Then we have 



c. 



(m+l) 



3 C r'c\ 



mm 

and thus 



(m+l) _ ^ o^max 1 \g I A (m) 

T " C V(l-^)^mm ^inJ' T ' 

which is strictly less than under (8.2) . It implies that converges to zero as 
the iteration continues. Therefore, there exists a sufficiently large m such that c^ 1 < 
{C2- l X m l n |5 |i)~ 2 holds for all m > m. 

Next assume that < (C2 _1 X mi 1 n |<5qIi) 2 ■ In this case, we have 



Jm+l) = 6 V C ^ 
~(l-/i)*ir 

and 



(A.16) 4 m+1 > = 2c' 1 ( - 3X °^ + 1 ) AV4 m) - 

\(1 — /i) -A mm / 

Recall that we are considering the case that when (A. 15) does not hold, so that < 
c^^-^IH)- 2 - Let 

(A.17) 4°°>=4c- 2 f 7 3Xma " +iVa 2 . 



:i-/x)x n 

As long as cf^ < cf (which is true under (8.3)), the fixed point of (A.16) is cf since the 
intial ci" 1 ^ starts from the right-hand side of the fixed point and converges to cf \ Recall 

that we can stop the iteration as soon as < . Thus, the iteration continues only a 

(i) 

finite number of times because cf is strictly decreasing. Each application of Lemmas 10 
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and 11 in the chaining argument requires conditioning on C(j]j), j = l,...,m*, for some 
finite m* . 

Furthermore, (A. 17) implies that 

Joo) = 12C-1 ( 3X max \ 
° ~ (l-M)^min V(l-M)^min ) ' 

Note that Cq°°^ < c$ under (8.4). Therefore, for each case, we have shown that \a — ao\ 1 < 
Ca ^ and |f — ro | < cf . The bound for the prediction risk can be obtained similarly, and 
then the bound for the sparsity of the Lasso estimator follows from Lemma 5. □ 
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Table 1 . List of Variables 
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Variable Names Description 
Dependent Variable 

gr Annualized GDP growth rate in the period of 1960-85 
Threshold Variables 

gdp60 Real GDP per capita in 1960 (1985 price) 

Ir Adult literacy rate in 1960 

Covariates 

lgdp60 Log GDP per capita in 1960 (1985 price) 

Ir Adult literacy rate in 1960 (only included when Q — Ir) 

ls k Log(Investment/Output) annualized over 1960-85; a proxy for the log physical sav- 
ings rate 

lgr pop Log population growth rate annualized over 1960-85 

pyrm60 Log average years of primary schooling in the male population in 1960 

pyrf60 Log average years of primary schooling in the female population in 1960 

syrm60 Log average years of secondary schooling in the male population in 1960 

syrf60 Log average years of secondary schooling in the female population in 1960 

hyrm60 Log average years of higher schooling in the male population in 1960 

hyrf60 Log average years of higher schooling in the female population in 1960 

nom60 Percentage of no schooling in the male population in 1960 

nof60 Percentage of no schooling in the female population in 1960 

prim60 Percentage of primary schooling attained in the male population in 1960 

prif60 Percentage of primary schooling attained in the female population in 1960 

pricmdO Percentage of primary schooling complete in the male population in 1960 

pricf60 Percentage of primary schooling complete in the female population in 1960 

secm60 Percentage of secondary schooling attained in the male population in 1960 

secf60 Percentage of secondary schooling attained in the female population in 1960 

seccm60 Percentage of secondary schooling complete in the male population in 1960 

seccf60 Percentage of secondary schooling complete in the female population in 1960 

llife Log of life expectancy at age averaged over 1960-1985 

Ifert Log of fertility rate (children per woman) averaged over 1960-1985 

edu/gdp Government expenditure on eduction per GDP averaged over 1960-85 

gcon/gdp Government consumption expenditure net of defence and education per GDP aver- 
aged over 1960-85 

revol The number of revolutions per year over 1960-84 

revcoup The number of revolutions and coups per year over 1960-84 

wardum Dummy for countries that participated in at least one external war over 1960-84 

wartime The fraction of time over 1960-85 involved in external war 

Ibmp Log(l+black market premium averaged over 1960-85) 

tot The term of trade shock 

lgdp60x 'educ' Product of two covariates (interaction of lgdp60 and education variables from 
pyrm60 to seccf60); total 16 variables 
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Table 2. Model Selection and Estimation Results with Q = gdpGO 





Linear Model 


Threshold Model 
? = 2898 






P 




X 



const. 


0.0232 


n fl9*}9 






Igdp60 


-0.0153 


n m 9n 








0.0033 


U.UUoo 






lqr„„„ 

j pop 

vvrf60 


0.0018 
0.0027 








syrm60 


0.0157 








hyrm60 


0.0122 


n m 

U.Ul ou 






hvrf60 


-0.0389 






-U.UoU I 


nom60 








Z.D4 X iu 


prim60 


-0.0004 


n nnm 

-U.UUUl 






pricm60 


0.0006 


— i.(o x iu 




-u.oo x iu 


pricfSO 


-0.0006 








sec] 60 


0.0005 








seccm60 


0.0010 






n nm a 


llife 


0.0697 


n 0^9*? 

u.uozo 






Ifert 


-0.0136 








edu/qdp 


-0.0189 








qcon/qdp 


-0.0671 








revol 


-0.0588 








revcoup 


0.0433 








wardum 


-0.0043 






-0 0022 


wartime 


-0.0019 


n m a*? 




-0.0023 


Ibmp 


-0.0185 


n m 7 a 

-U.Ul t 4 




-0.0015 


tot 


0.0971 






0.0974 


lqdp60 x vyrf60 










IgdpSO x syrm60 








0.0002 


lqdr>60 x hvrm60 








0.0050 


lgdp60 x hyrf60 


_ 


-0.0003 






lgdp60 x nom60 








8.26 x 10~ 6 


lgdp60 x prim60 


-6.02 x 10~ 7 








lgdp60 x prif60 


-3.47 x 10~ 6 






-8.11 x 10~ 6 


lgdp60 x pricf60 


-8.46 x 10~ 6 








lgdp60 x secm60 


-0.0001 








lgdp60 x seccf60 


-0.0002 


-2.87 x 10~ 6 






X 


0.0004 




0.0034 


M{a) 


28 




26 




of covariates 


46 




92 




# o/ obsesrvations 


80 




80 




R 2 


0.85 




0.80 





Note: The regularization parameter A is chosen by the 'leave-one-out' least squares 
cross validation method. M(a) denotes the number of covariates to be selected by 
LASSO, and '-' indicates that the regressor is not selected. Recall that /3 is the coeffi- 
cient when Q > 7 and that 8 is the change of the coefficient value when Q < 7. 
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Table 3. Model Selection and Estimation Results with Q = Ir 



33 





T * TV .r 11 

Linear Model 




Threshold Model 






? = 82 






P 


S 


const. 


0.0224 


0.0224 




Iqdp60 


-0.0159 


-0.0099 




lSk 


0.0038 


0.0046 




syrm60 


0.0069 






hyrm60 


0.0188 


0.0101 




prim60 


-0.0001 


-0.0001 




pricmSO 


0.0002 


0.0001 


0.0001 


seccm60 


0.0004 




0.0018 


llife 


0.0674 


0.0335 




IfeH 


-0.0098 


-0.0069 




edu/gdp 


-0.0547 






qcon/qdv 


-0.0588 


-0.0593 




revol 


-0.0299 






revcoup 


0.0215 






wardum 


-0.0017 






wartime 


-0.0090 


-0.0231 


_ 


Ibmp 


-0.0161 


-0.0142 




tot 


0.1333 


0.0846 




lgdp60 x hyrf60 


-0.0014 




-0.0053 


lgdp60 x nof60 


1.49 x 10~ 5 






lgdp60 x prif60 


-1.06 x 10~ 5 




-2.66 x 10~ 6 


lgdp60 x seccf60 


-0.0001 






A 


0.0011 




0.0044 


M(a) 


22 




16 


of covariates 


47 




94 


# of observations 


70 




70 


R 2 


0.82 




0.80 



Note: The regularization parameter A is chosen by the 'leave-one-out' least squares 
cross validation method. M(a) denotes the number of covariates to be selected by 
LASSO, and '-' indicates that the regressor is not selected. Recall that is the coeffi- 
cient when Q > 7 and that 8 is the change of the coefficient value when Q < 7. 
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Table 4. Simulation Results with M = 50 



Threshold 
Parameter 



Estimation 
Method 



Constant Prediction Error (PE) 
for A Mean Median 



SD 



E[M(a)] E\a-a \ 



E\ 



TO 



Jump Scale: c = 1 



ro = 0.5 



Least Squares 


None 


0.285 


0.276 


0.074 


100.00 


7.066 


0.008 




A = 2.8 


0.041 


0.030 


0.035 


12.94 


0.466 


0.010 


Lasso 


A = 3.2 


0.048 


0.033 


0.049 


10.14 


0.438 


0.013 


A = 3.6 


0.067 


0.037 


0.086 


8.44 


0.457 


0.024 




A = 4.0 


0.095 


0.050 


0.120 


7.34 


0.508 


0.040 


Oracle 1 


None 


0.013 


0.006 


0.019 


4.00 


0.164 


0.004 


Oracle 2 


None 


0.005 


0.004 


0.004 


4.00 


0.163 


0.000 



Least Squares 


None 


0.317 


0.304 


0.095 


100.00 


7.011 


0.008 




A = 2.8 


0.052 


0.034 


0.063 


13.15 


0.509 


0.016 


Lasso 


A = 3.2 


0.063 


0.037 


0.083 


10.42 


0.489 


0.023 


A = 3.6 


0.090 


0.045 


0.121 


8.70 


0.535 


0.042 




A = 4.0 


0.133 


0.061 


0.162 


7.68 


0.634 


0.078 


Oracle 1 


None 


0.014 


0.006 


0.022 


4.00 


0.163 


0.004 


Oracle 2 


None 


0.005 


0.004 


0.004 


4.00 


0.163 


0.000 



Least Squares 


None 


2.559 


0.511 


16.292 


100.00 


12.172 


0.012 




A = 2.8 


0.062 


0.035 


0.091 


13.45 


0.602 


0.030 


Lasso 


A = 3.2 


0.089 


0.041 


0.125 


10.85 


0.633 


0.056 


A = 3.6 


0.127 


0.054 


0.159 


9.33 


0.743 


0.099 




A = 4.0 


0.185 


0.082 


0.185 


8.43 


0.919 


0.168 


Oracle 1 


None 


0.012 


0.006 


0.017 


4.00 


0.177 


0.004 


Oracle 2 


None 


0.005 


0.004 


0.004 


4.00 


0.176 


0.000 



Jump Scale: c = 



N/A 



Least Squares 


None 


6.332 


0.460 


41.301 


100.00 


20.936 




A = 2.8 


0.013 


0.011 


0.007 


9.30 


0.266 


Lasso 


A = 3.2 


0.014 


0.012 


0.008 


6.71 


0.227 


A = 3.6 


0.015 


0.014 


0.009 


4.95 


0.211 




A = 4.0 


0.017 


0.016 


0.010 


3.76 


0.204 


Oracle 1 & 2 


None 


0.002 


0.002 


0.003 


2.00 


0.054 



N/A 



Note: M denotes the column size of Xi and r denotes the threshold parameter. Oracle 1 & 2 are estimated by 
the least squares when sparsity is known and when sparsity and ro are known, respectively. All simulations 
are based on 400 replications of a sample with 200 observations. 
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Figure 1. Mean Prediction Errors and Mean M(a) 





M 



= 400 
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Figure 2. Mean £i-Errors for a and r 






A=2.8 A=3.2 A=3.6 A=4.0 01 02 A=2.8 A=3.2 A=3.6 A=4.0 01 02 

Regularization Parameter/ Oracle 1 & 2 Regularization Parameter/ Oracle 1 & 2 



M 



= 400 
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Table 5. Simulation Results with M = 50 and p = 0.3 
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Threshold 
Parameter 



Estimation 
Method 



Constant Prediction Error (PE) 
for A Mean Median 



SD 



E[M{a)] E\a-a \ 



E\ 



TO 



Jump Scale: c = 1 



ro = 0.5 



Least Squares 


None 


0.283 


0.273 


0.075 


100.00 


7.718 


0.010 




A = 2.8 


0.075 


0.043 


0.087 


12.99 


0.650 


0.041 


Lasso 


A = 3.2 


0.108 


0.059 


0.115 


10.98 


0.737 


0.071 


A = 3.6 


0.160 


0.099 


0.137 


9.74 


0.913 


0.119 




A = 4.0 


0.208 


0.181 


0.143 


8.72 


1.084 


0.166 


Oracle 1 


None 


0.013 


0.006 


0.017 


4.00 


0.169 


0.005 


Oracle 2 


None 


0.005 


0.004 


0.004 


4.00 


0.163 


0.000 



Least Squares 


None 


0.317 


0.297 


0.099 


100.00 


7.696 


0.010 




A = 2.8 


0.118 


0.063 


0.123 


13.89 


0.855 


0.094 


Lasso 


A = 3.2 


0.155 


0.090 


0.139 


11.69 


0.962 


0.138 


A = 3.6 


0.207 


0.201 


0.143 


10.47 


1.150 


0.204 




A = 4.0 


0.258 


0.301 


0.138 


9.64 


1.333 


0.266 


Oracle 1 


None 


0.013 


0.007 


0.016 


4.00 


0.168 


0.006 


Oracle 2 


None 


0.005 


0.004 


0.004 


4.00 


0.163 


0.000 



Least Squares 


None 


1.639 


0.487 


7.710 


100.00 


12.224 


0.015 




A = 2.8 


0.149 


0.080 


0.136 


14.65 


1.135 


0.184 


Lasso 


A = 3.2 


0.200 


0.233 


0.138 


12.71 


1.346 


0.272 


A = 3.6 


0.246 


0.284 


0.127 


11.29 


1.548 


0.354 




A = 4.0 


0.277 


0.306 


0.116 


10.02 


1.673 


0.408 


Oracle 1 


None 


0.013 


0.006 


0.017 


4.00 


0.182 


0.005 


Oracle 2 


None 


0.005 


0.004 


0.004 


4.00 


0.176 


0.000 



Jump Scale: c = 



N/A 



Least Squares 


None 


6.939 


0.437 


42.698 


100.00 


23.146 




A = 2.8 


0.012 


0.011 


0.007 


9.02 


0.248 


Lasso 


A = 3.2 


0.013 


0.011 


0.008 


6.54 


0.214 


A = 3.6 


0.014 


0.013 


0.009 


5.00 


0.196 




A = 4.0 


0.016 


0.014 


0.010 


3.83 


0.191 


Oracle 1 & 2 


None 


0.002 


0.002 


0.003 


2.00 


0.054 



N/A 



Note: M denotes the column size of Xi and r denotes the threshold parameter. Oracle 1 & 2 are estimated by 
the least squares when sparsity is known and when sparsity and ro are known, respectively. All simulations 
are based on 400 replications of a sample with 200 observations. 
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Figure 3. Mean Prediction Errors and Mean M(a) when p = 0.3 
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Figure 4. Mean £i-Errors for a and r when p = 0.3 
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Figure 5. Probability of Selecting True Parameters when p = &ip = 0.3 
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